Discovering Editing Rules For Data Cleaning

نویسندگان

  • Thierno Diallo
  • Jean-Marc Petit
  • Sylvie Servigne
چکیده

Dirty data continues to be an important issue for companies. The database community pays a particular attention to this subject. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repair methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even introduce new errors. Based on Master Data Management principles, a new class of data quality rules known as Editing Rules (eR) tells how to fix errors, pointing which attributes are wrong and what values they should take. However, finding data quality rules is an expensive process that involves intensive manual efforts. In this paper, we develop pattern mining techniques for discovering eRs from existing source relations (possibly dirty) with respect to master relations (supposed to be clean and accurate). In this setting, we propose a new semantics of eRs taking advantage of both source and master data. The problem turns out to be strongly related to the discovery of both CFD and one-toone correspondences between sources and target attributes. We have proposed efficient techniques to address the discovery problem of eRs and heuristics to clean data. We have implemented and evaluated our techniques on reallife databases. Experiments show both the feasibility, the scalability and the robustness of our proposal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Editing Rules: Discovery and Application to Data Cleaning

Dirty data is a serious problem for businesses, leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to corre...

متن کامل

Discovering Data Quality Rules in a Master Data Management

Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject whe...

متن کامل

Règles d’Edition: Fouille et Application au Nettoyage de Données

Dirty data is a serious problem for businesses, leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to corre...

متن کامل

CerFix: A System for Cleaning Data with Certain Fixes

We present CerFix, a data cleaning system that finds certain fixes for tuples at the point of data entry, i.e., fixes that are guaranteed correct. It is based on master data, editing rules and certain regions. Given some attributes of an input tuple that are validated (assured correct), editing rules tell us what other attributes to fix and how to correct them with master data. A certain region...

متن کامل

Detecting Inconsistencies in Private Data with Secure Function Evaluation

Erroneous and inconsistent data, often referred to as ‘dirty data’, is a major worry for businesses. Prevalent techniques to improve data quality consist of discovering data quality rules, identifying records that violate those rules, and then modifying the data to either remove those violations. Most of the work described in the literature deals with cases where both the data and the rules are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012